Algorithms for Molecular Biology — Latest Matching Preprints

1

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 0.1%

3.3%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

2

Lossless compression of k-mer matrices enabling random row access

Regnier, A.; Lemane, T.; Bellenous, S.; Chikhi, R.; Peterlongo, P.

2026-07-08 bioinformatics 10.64898/2026.07.03.736306 medRxiv

Top 0.1%

1.9%

Show abstract

Genomic search engines such as Logan-Search index petabytes of sequencing data as large binary matrices, called k-mer matrices, where each row encodes the presence of a k-mer across thousands to millions of genomic samples. Logan-Search contains a petabyte of binary matrices, and storing them is expensive, yet compression must not prevent fast random access to any matrix row at query time. We present kmcomp, a lossless compression method for k-mer matrices that satisfies these competing requirements. Block compression partitions the matrix into fixed-size row blocks, each compressed independently; block start positions are stored in an Elias-Fano encoded array, enabling O(1) random access to any block. To improve compressibility without introducing additional decompression steps, we introduce the {pi}-compression: a column reordering that groups similar samples together by solving the Traveling Salesman Problem via a nearest-neighbor heuristic. We accelerate this heuristic with a novel variant of the vantage-point tree, the masked vp-tree, which dynamically prunes nearest-neighbor search space. On three (meta)genomic datasets, kmcomp achieves compression ratios of 1.3 to 5.4; {pi}-compression further improves these to 1.5 to 51.3. Applied to the Logan-Search petabyte-scale index, compression reduces storage by approximately half, and {pi}-compression adds a further 13% gain. Query overhead remains modest: queries of hundreds of nucleotides incur an absolute latency increase of {approx} 100 ms, and highly compressed indexes can match uncompressed query times thanks to reduced disk reads.

3

Beyond infinite sites: Generalized ABBA-BABA statistic for deeper phylogenies

Zhang, C.; Nielsen, R.

2026-07-08 bioinformatics 10.64898/2026.07.06.736715 medRxiv

Top 0.1%

1.7%

Show abstract

The Patterson's D statistic detects gene flow from ABBA-BABA site patterns, but its biallelic site patterns fail under deeper divergences where multiple hits cause false positives. We propose two extensions, D+ and D*. Both incorporate multiallelic site patterns to reduce saturation bias under JC and F84 model. Simulations show that D+ and D* both remain correctly null under all conditions and detect gene flow effectively, with distinct advantages: D+ guarantees non-negativity of the denominator, while D* provides greater robustness when mutation rates vary across genomic regions. The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER.

4

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.1%

1.3%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

5

Navigating the pangenome coordinate system with Shredtools

Shivakumar, V. S.; Langmead, B.

2026-07-08 bioinformatics 10.64898/2026.07.03.736354 medRxiv

Top 0.2%

1.0%

Show abstract

Existing notions of pangenome coordinates rely on hard-to-compute multiple sequence alignments. On the other hand, pangenome-wide exact unique matches (multi-MUMs) can be computed efficiently, and represent conserved stretches of columns in the underlying MSA. We introduce Shredtools, which uses multi-MUMs as pangenome waypoints and allows for sophisticated queries in pangenome coordinates. Its primary query is extract, which takes an interval of one sequence and extracts the smallest window containing it that is syntenic pangenome-wide. Shredtools' extract query can extract a gene region from 476 human genomes in half a second. Other queries help to refine these results, by finding local exact matches to improve the density of multi-MUM coverage ("enhance") and by selectively discarding sequences to improve the precision of the syntenic region ("zoom"). The Shredtools web interface (available at https://vikshiv.github.io/shredtools) allows for client-side handling of extract queries with index queries handled via simple and fast HTTP Range requests, simplifying usage and enabling pangenome-scale discoveries.

6

synpact: accurate, memory-light PacBio HiFi read mapping via a hierarchy of locally-consistent syncmer blocks

Aydin, M. S.; Sahlin, K.

2026-07-02 bioinformatics 10.64898/2026.06.28.735066 medRxiv

Top 0.2%

0.8%

Show abstract

Motivation: Mapping PacBio HiFi reads is a routine task and serves as a central step in many bioinformatics analyses. However, the most accurate long-read mappers have a high memory consumption and are slow. Some light-weight mappers have been proposed for faster runtime, but their accuracy is not comparable to state-of-the-art mappers. With the increasing number of available reference sequences, memory-efficient and fast methods for read mapping without the large accuracy drop are desired. A general trade-off with seed-chain-extend mappers is selecting a single, fixed seed size, which forces a compromise between sensitivity and specificity. Results: We present synpact, a long-read mapper that uses several seed sizes (a hierarchy) constructed with Locally Consistent Parsing (LCP) over syncmers. A read is mapped by querying for matches at different levels, followed by sliding window voting. By storing only the coarse upper levels rather than the full hierarchy, the index holds several times fewer entries, while still handling errors by falling back from coarser to finer stored levels at query time. We benchmark synpact against popular long-read mappers on four genomes and different read lengths. For simulated PacBio HiFi data, synpact matches or approaches minimap2 accuracy with higher precision in most cases, while using roughly 5-13 times less peak memory (e.g., about 0.8GB vs. 10.7GB on human) and mapping faster on large or repetitive genomes (e.g., about 10 to 13 times faster than minimap2 on rye). On real HiFi reads synpact has high concordance with minimap2 across the four genomes, as opposed to the other lightweight long-read mappers. Availability and Implementation: synpact is written in Rust and is available at https://github.com/mahmudsami/synpact

7

Extended t-cores for the de novo identification of transposable elements and other inexact repeats from short read RNAseq data

Darmon, S.; Mary, A.; Lacroix, V.

2026-07-10 bioinformatics 10.64898/2026.07.06.736737 medRxiv

Top 0.2%

0.6%

Show abstract

Transcribed repeats represent a major challenge in the de novo assembly of transcriptomes from short RNA-seq reads. Young transposable elements (TEs) and other inexact repeats create dense and ambiguous regions in the assembly graph, preventing the correct assembly of transcripts. In this paper, we introduce a fully de novo method based on the discovery of dense regions in the compacted De Bruijn graph (DBG) to identify such repeats directly from short reads RNA-seq data, without requiring a reference genome or repeat database. Our approach defines the extended t-cores, subgraphs of the DBG that capture the complex topology induced by highly expressed inexact repeats appearing in RNA-seq reads. Independently of its interest for transcriptome assembly, the proposed method appears to be effective for the de novo identification of repeats in transcriptomes. After classifying cores using sequence-based motifs to distinguish simple repeats from potential TEs, we demonstrate its potential for the de novo discovery of transposable elements. We validate the approach on a Mus musculus dataset using expressed TE consensus sequences, showing that extended t-cores correspond to known expressed TE families. We also illustrate its de novo discovery potential on a non-model species, Canis lupus familiaris, where the method was also able to recover known transposable elements.

8

EDTA v2: enabling scalable TE annotation in animal genomes

Ou, S.; Lu, T.; Nguyen, H.; Gerhardt, K.; Fang, N. F.; Rashid, U.; Guhlin, J.; Dainat, J.; Bao, Z.; Bayer, P. E.; Na, Y.; Benson, C.

2026-07-06 genomics 10.64898/2026.07.01.735963 medRxiv

Top 0.2%

0.5%

Show abstract

The Extensive de-novo TE Annotator (EDTA) automates transposable element annotation in plant genomes but lacks direct LINE/SINE detection, limiting its applicability to animal genomes. We present EDTA v2, which integrates LINE and SINE detection, completely rewrites TIR-Learner for deployability and scalability, and accelerates structural detectors by up to two orders of magnitude. Tested in 30 animal genomes from the Vertebrate Genomes Project Phase I, EDTA v2 bridges the non-LTR detection gap that has prevented automated TE annotation in animals.

9

fastQpick: scalable bootstrap and subsampling of FASTQ reads

Rich, J.; Pachter, L.

2026-06-24 bioinformatics 10.64898/2026.06.23.734068 medRxiv

Top 0.3%

0.4%

Show abstract

Summary: fastQpick is a command-line tool and Python library for sampling FASTQ reads with replacement. Sampling with replacement turns a single FASTQ file into an arbitrary number of bootstrap replicates, which enables uncertainty quantification and statistical analysis at the level of raw reads. This process answers questions such as how much an abundance estimate would change if the library were resequenced, or whether a low-abundance call is robust to the particular reads that were sequenced. fastQpick works efficiently on large libraries by streaming files in two passes by default: first to count reads and create a hash-based counter, and then to write the sample. It generates a full-size bootstrap replicate of a 500-million-read library in under 30 minutes with 9.4 GB of peak memory, with a low-memory mode that reduces the peak to 1.4 GB. A single-pass mode draws samples in a single read through the file, using O(1) working memory and producing an output size that is exact in expectation but not fixed. In a real yeast RNA-seq experiment, bootstrap replicates generated by fastQpick recover the sampling uncertainty of transcript abundance estimates, matching the analytic multinomial standard errors to within a few percent. Availability and implementationfastQpick is open source and freely available under the MIT license on GitHub at https://github.com/pachterlab/fastQpick and on PyPI (pip install fastQpick).

10

GeneBench-Pro: Evaluating Multistage Statistical Reasoning\\in Genomics, Quantitative Biology, and Translational Biomedicine

Li, J. H.; Ho, A. J.

2026-06-30 bioinformatics 10.64898/2026.06.29.735386 medRxiv

Top 0.3%

0.4%

Show abstract

We introduce GeneBench-Pro, an expanded and improved version of GeneBench that comprises harder problems across a wider breadth of domains. GeneBench-Pro is a benchmark for AI agents performing realistic multi-stage scientific analyses in genomics, quantitative biology, and translational biomedicine which seeks to capture the complexity of real-world problems that computational life scientists face when tasked with producing a conclusion upon which a downstream scientific or translational decision is contingent. The benchmark comprises 129 evaluations targeting quantities of direct practical relevance across 10 primary domains and 21 terminal subdomains, with a genomics-centered core. Similarly to GeneBench, each problem provides the agent with brief context, a target estimand, and minimal guidance otherwise; the agent must then navigate multiple dependent decision points; i.e., substantive inferential forks where a plausible wrong choice changes the downstream analysis, to identify and execute the correct analysis workflow and arrive at the correct answer. Relative to GeneBench, GeneBench-Pro adds 29 new problems, drops three, and introduces significantly redesigned versions of 54 of the remaining 100 overlapping problems. 82 of the 129 problems were reviewed by external domain experts, whose findings led to prompt/data modifications and redesign of those problems whose targets were not sufficiently identifiable. Ten externally reviewed problems are released publicly, 50 held-out problems were provided to Artificial Analysis for independent third-party model benchmarking, and the remainder are retained as an internal holdout. In evaluations over the full 129-problem suite, GPT-5.6 Sol reaches an eval-level pass rate of 28.7% at the max reasoning level, and GPT-5.6 Sol Pro reaches 31.5% in separately reported GPT Pro runs. GPT-5.5 reaches 12.0%, GPT-5.4 reaches 8.9%, and the strongest non-GPT baseline, Claude Opus 4.8, reaches 16.0%. As with GeneBench, models often complete substantial portions of the workflow but exhibit a consistent gap between noticing and acting by identifying local diagnostic signals but failing to propagate the implications to the corresponding analysis decision. As a result, models often select wrong estimators or persist on initially plausible but incorrect analysis paths. GeneBench-Pro therefore measures an emerging capability of long-horizon biological reasoning that remains unreliable.

11

Tiny Subsamples and Upsampling Tame Big Data Evolutionary Analysis in Phylogenomics

Kumar, S.; Tamura, K.; Sharma, S.

2026-06-23 evolutionary biology 10.64898/2026.06.21.733599 medRxiv

Top 0.3%

0.3%

Show abstract

Long runtime, high memory demands, and reliance on high-performance computing increasingly limit the evolutionary analysis of long phylogenomic datasets. We review a scalable framework based on phylogenomic subsampling and upsampling (PSU), in which many small subsamples of sites from a long concatenated sequence alignment are extended by upsampling prior to inference, and the resulting analyses are then aggregated to obtain stable evolutionary estimates. PSU exploits a useful distinction between the computational burden and the inferential power of statistical methods in molecular phylogenetics: computational cost is strongly influenced by the number of distinct site patterns in the concatenated alignment, whereas statistical power depends primarily on the amount of evolutionary information represented by sites and substitutions. By reducing the former while restoring the latter through upsampling, PSU can approximate many full-data analyses at substantially lower computational cost. Evidence from simulated and empirical datasets shows that PSU can accurately estimate bootstrap support values, select optimal substitution models, test evolutionary hypotheses, and infer branch lengths, divergence times, and associated uncertainty measures, while often reducing runtime and memory requirements by orders of magnitude. The same subsampling-upsampling-aggregation principle underlies all of these applications. PSU also provides distributions of inferred clade support across independent subsamples, enabling detection of concordant and conflicting phylogenetic signals that may remain hidden in conventional concatenated phylogenomic analyses. Adaptive procedures for selecting the subsample size, the number of subsamples, and the number of upsampling replicates make the framework practical across diverse datasets. We suggest that PSU is a general strategy for scalable phylogenomic inference across a broad range of statistical methods. By enabling rigorous analyses of genome-scale alignments on standard computing hardware, PSU expands access to computationally intensive evolutionary methods while reducing the environmental and infrastructural costs of big-data phylogenomics.

12

Homology-aware cross-validation strategies for generalization assessment in RNA structure prediction

Bugnon, L.; Kulemeyer, G.; Gerard, M.; Di Persia, L.; Stegmayer, G.; Milone, D. H.

2026-06-29 bioinformatics 10.64898/2026.06.28.735057 medRxiv

Top 0.3%

0.3%

Show abstract

RNA secondary structure prediction is a fundamental challenge in bioinformatics, essential for understanding the functional roles of non-coding RNAs. Recently, deep learning models have transformed the field with impressive results, leading to critical discussions regarding the validity of current cross-validation strategies. On the one hand, traditional random partitioning yields overop-timistic results due to data leakage from uncontrolled homology. On the other hand, removing from the training set all sequences that exhibit even the slightest resemblance to the testing sequences penalizes learning-based methods by requiring generalization to completely out-of-distribution sequences. While it is very simple to remove sequences and retrain a machine learned model, it is very difficult to remove the experimental data used for parameter tuning and the sequences used for the development of classical thermodynamic methods. Thus, these methods often benefit from an implicit knowledge leakage. In this work we critically review existing cross-validation strategies for RNA secondary structure prediction: random splitting, clustering-based splitting, and leaving one RNA family out for testing. We analyze the advantages and limitations of each strategy, also expanding them towards the future directions to ensure fair comparisons across the full range of sequence similarities, with the same rigor for both classical and learning-based methods.

13

CoalMiner: a coalescent model generator for fastsimcoal2

Esplin-Stout, R.; Sethuraman, A.

2026-06-30 evolutionary biology 10.64898/2026.06.25.734618 medRxiv

Top 0.3%

0.3%

Show abstract

Demographic inference using the Site Frequency Spectrum (SFS) is often constrained by the number and complexity of models tested. Here we present a coalescent model generator called CoalMiner for use with fastsimcoal2. CoalMiner utilizes a decision tree framework to generate biologically plausible models, with user input dictating the number and ranges of demographic parameters and histories, which can then be plugged into the fastsimcoal2 pipeline. Using extensive simulations and empirical data, we show that CoalMiner is an effective helper tool to explore demographic model space. CoalMiner is written in Python and is freely available on GitHub: https://github.com/raywray/coalminer with numerous vignettes and tutorials.

14

Does ensembling improve feature attributions from sequence-to-activity models?

Maslova, A.; Libbrecht, M.

2026-07-13 bioinformatics 10.64898/2026.07.08.737315 medRxiv

Top 0.4%

0.3%

Show abstract

Sequence-to-activity models take as input DNA sequence and predict genomic activities such as transcription factor binding and gene expression. Applying explainable AI (xAI) methods such as DeepLIFT to these models has recently led to breakthroughs towards many genomic problems, including transcription factor binding grammar and predicting effects of genetic variants. However, there remains significant uncertainty about the reliability of sequence-to-activity interpretations. Thus, we need accurate probabilistic measures of confidence to distinguish reliable from unreliable interpretations. Towards this end, researchers have recently aimed to characterize variability across ensembles of S2A models. However, previous work has focused on using model ensembles to improve the model predictions themselves. Here, we aim to evaluate whether model ensembles can also be used to improve feature attributions from post-hoc xAI methods. We find that ensembling attributions from multiple models improves downstream applications, including identifying transcription factor motifs and predicting regulatory genetic variants. We show that forming an ensemble using Monte Carlo Dropout (MCDropout) gets near to, but does not match, the performance of training multiple models, at much less train-time computational cost.

15

Revisiting the base pair maximization approach for RNA secondary structure prediction with SQUARNA

Serdakov, M. D.; Bohdan, D. R.; Nikolaev, G. I.; Bujnicki, J. M.; Baulin, E. F.

2026-07-01 bioinformatics 10.64898/2026.06.30.735492 medRxiv

Top 0.4%

0.3%

Show abstract

Non-coding RNAs play diverse roles in a wide range of cellular processes, with their spatial structure being pivotal to their function. RNA secondary structure is a key determinant of its overall fold. Given the scarcity of experimentally determined RNA 3D structures, understanding secondary structure is vital for discerning RNA function. Currently, there is no universally effective solution for de novo RNA secondary structure prediction. Existing methods are becoming increasingly complex without marked improvements in accuracy and often overlook critical features such as pseudoknots and alternative folds. Here, we introduce SQUARNA, a new approach to de novo RNA secondary structure prediction that is suitable for both individual RNA analysis and large-scale structural searches. SQUARNA revisits the concept of base pair maximization and develops it into a stem maximization idea coupled with the widely used free energy minimization (MFE) framework. SQUARNA can predict alternative structures and handle pseudoknots of arbitrary complexity. Benchmarking shows that SQUARNA outperforms existing methods, including deep learning models, in both single-sequence and alignment-based RNA secondary structure prediction. SQUARNA seamlessly integrates sequence and alignment information with experimental data, such as residue reactivities obtained by chemical probing, as well as other structural restraints, including automated searches for Rfam database templates, G-quadruplex patterns, and protein-binding motifs. SQUARNA is available as a standalone tool at https://github.com/febos/SQUARNA and as a web server at https://larnal.imol.institute.

16

Generative continuous time model reveals epistatic signatures in protein evolution

Pagnani, A.; Barrat-Charlaix, P.

2026-07-10 bioinformatics 10.1101/2025.09.17.676821 medRxiv

Top 0.4%

0.3%

Show abstract

Protein evolution is fundamentally shaped by epistasis, where the effect of a mutation depends on the sequence context. As standard phylogenetic methods assume independently evolving sites, there is a need for more complex models based on accurate estimations of the fitness landscape. Good candidates are modern generative models -- such as the Potts model -- which successfully capture epistatic effects. However, recent work on generative evolutionary models usually use discrete time, making them difficult to integrate with the standard frameworks in evolutionary biology. We introduce a continuous-time sequence evolution model using the Gillespie algorithm and parameterized by a generative Potts model. This approach enables us to simulate realistic, family-specific evolutionary trajectories and allows for direct comparison with independent-site models. Surprisingly, we find that while epistasis significantly slows down evolution, it does not change the average evolutionary rates at individual sites. This is explained by the rate heterogeneity caused by context-dependence: we show that the rate at some positions varies between null to high values depending on the context, while other positions are essentially independent from the context. Finally, we show that epistasis leads to a systematic underestimation bias in the inference of evolutionary distance between sequences. Overall, our work provides a new tool for simulating realistic protein evolution and offers novel insights into the complex interplay between epistasis and evolutionary dynamics.

17

Benchmarking large language models for ACMG/AMP variant interpretation and variant calling

Corpas, M.

2026-07-05 genomics 10.64898/2026.06.30.735646 medRxiv

Top 0.4%

0.3%

Show abstract

Agentic large language models are increasingly used across the genomic workflow, from variant calling to clinical interpretation, yet they are evaluated by accuracy alone, a single figure that cannot say whether a system is safe or where in the workflow a failure originates. We present ClawBench, a framework that attributes each outcome to the architectural layer that produced it across both halves of the canonical pipeline. Two design choices remove the confounds that make agentic genomics hard to evaluate: a temporally blinded truth set, in which every scored ClinVar label first became available only after the training cutoff of every model tested, and a fail-closed evidence contract that blocks evidence circular with the truth label. We score validity, safety, provenance and reproducibility, not accuracy alone, under a constraint gradient that relocates correctness from a model's prior into executed, validated code. We show three things. First, dangerous misclassification is rare and model-invariant, a controlled precondition of the executed architecture rather than a frontier, while fabricated evidence is measurable and is neutralised by execution. Second, different variant classes are rate-limited by different layers: loss-of-function variants by the deterministic combiner threshold, and rare missense by evidence formation, where evidence acquisition is asymmetric and capped and strength assignment is a recoverable layer that naive strength-licensing prompts confound. Third, for variant calling the arms separate not on whether a model can plan a pipeline, which all do, but on trust properties, pinning, provenance, auditability and reproducibility, which climb monotonically toward validated execution; and a local open-weight model reproduces the safety result yet meets the structured-output and provenance contract far less often than frontier models, a conformance gap rather than a capability or safety gap. An end-to-end join attributes failures across the whole workflow, separating a missed call from a propagated genotype error from a correctly called but misinterpreted variant.

18

FORGE reveals an information spectrum encoded in RNA tertiary-structure geometry

Gow, L.; Li, J.; Tan, X.; Li, L.

2026-07-06 bioinformatics 10.64898/2026.07.05.736550 medRxiv

Top 0.4%

0.2%

Show abstract

Coarse RNA coordinate representations are increasingly used for inverse folding and structural annotation, but the biological information encoded in such representations is not well quantified. We introduce FORGE (Feature-engineered RNA Geometry Evaluation), a feature-engineering framework that extracts 935 interpretable descriptors from six backbone atoms and one glycosidic-anchor atom per residue. In a temporal test on 4,135 post-2025 PDB RNA chains, FORGE recovered 64.6\% of native nucleotides; a six-atom control without the glycosidic nitrogen retained 58.5\%, and abstaining from the least-confident half of calibration positions retained 94.4\% accuracy. The same representation predicted base-pair state (79.2\% accuracy), a RibonanzaNet-inferred DMS reactivity proxy ($R^2=0.329$) and protein-proximal context (AUC approximately 0.67). Native-decoy, OpenKnot and solved AI-designed pseudoknot tests further show that nucleotide identifiability, foldability and design score are distinct objectives. FORGE provides a reproducible audit layer for RNA structural interpretation.

19

Adaptive 2.5D base-pairing subgraph search detects RNA small-molecule binding sites

Nitchi, D.; Waldispuhl, J.; Oliver, C.

2026-07-10 bioinformatics 10.64898/2026.07.07.737024 medRxiv

Top 0.4%

0.2%

Show abstract

Ribo-LENS is a geometric deep-learning framework for detecting small-molecule binding sites in RNA structures. It is designed to exploit two properties of RNA base-pairing networks: their robustness to conformational fluctuation and the functional signatures they encode. By reasoning directly in the space of base-pairing subgraphs, Ribo-LENS assembles coherent binding sites, in contrast to methods that score residues independently. In extensive experiments, Ribo-LENS is competitive with, and often outperforms, large all-atom co-folding models (AlphaFold3, Chai-1), fine-tuned language models (GerNA-Bind), and structure-based tools (RNAsite), raising mean MCC to 0.380 (versus 0.321 for the state-of-the-art GerNA-Bind). It is strongly robust to apo/holo rearrangement, with its accuracy tracking the base-pairing graph (Spearman rho = 0.82 with binding-site graph edit distance) rather than backbone displacement (rho = -0.15 with RMSD), and depends far less on sequence homology than competing predictors. In an end-to-end, sequence-based virtual screen of the ROBIN assay (approximately 25,000 compounds), Ribo-LENS guides docking to a small predicted pocket, matching a blind all-atom cavity search (enrichment factors up to 6.1) at a fraction of the search cost; on two SARS-CoV-2 targets its predicted sites align with NMR chemical-shift perturbations. Ribo-LENS turns coarse base-pairing structure into a practical entry point for screening the vast, largely unexplored RNA target space.

20

Genomic Annotation Infrastructure (GAIn): Pipelines and Resource Repositories for Annotating Variants, Positions, and Regions

Cokol, M.; Chorbadjiev, L.; Lee, Y.-h.; Jamsandekar, M.; Gergova, I.; Todorov, I.; Iossifov, I.

2026-07-12 bioinformatics 10.64898/2026.07.08.737273 medRxiv

Top 0.4%

0.2%

Show abstract

Interpretation of genomic variants, positions, and regions depends on reliable annotation--adding evidence such as predicted effect, conservation, population frequency, and gene-level context--yet the underlying resources are numerous, versioned, and assembly-specific. We present the Genomic Annotation Infrastructure (GAIn), a platform that generates transparent, reproducible annotations via declarative pipelines that define annotation tasks as ordered lists of components, called annotators, that produce annotation attributes using genomic resources from Genomic Resource Repositories (GRRs). We provide two public GRRs: a main repository containing more than 250 heterogeneous genomic resources, and a separate GRR-ENCODE repository containing resources derived from thousands of ENCODE (Encyclopedia of DNA Elements) project experiments. Users can use the annotation pipelines we made available, author custom annotation pipelines, and execute annotation tasks with these pipelines via GAIns web and command-line interfaces. The web interface can be used without any setup, but it relies on shared computational infrastructure and imposes limits on the size of annotation tasks. The command-line interface requires setup but supports arbitrarily large annotation tasks through simple-to-use parallelization and offers a broader set of features. For example, command-line GAIn can be extended by using custom GRRs or creating custom annotators via its plugin architecture. In addition, GAIns re-annotation feature, which updates annotations as they evolve, substantially simplifies maintaining annotations in a large genomics analysis project. GAIns resource management, explicit versioning, and pipeline abstraction provide an auditable, maintainable, and efficient foundation for modern genomic annotation across reference assemblies and use cases.